fix: support large index in forest car#6690
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughWidened index size types to u64, added a ZstdSkipFramesEncodedDataReader and an async writer path write_zstd_skip_frames_into (supports single/multiple skip-frames), switched encoder/bench to the new writer, and changed several conversions to return Results to propagate errors. Changes
Sequence Diagram(s)sequenceDiagram
participant Encoder as Encoder
participant Writer as ZstdSkipFramesWriter
participant Sink as AsyncWrite
participant IndexReader as IndexReader
Encoder->>Writer: construct writer with index data
Encoder->>Writer: call write_zstd_skip_frames_into(sink)
Writer->>Sink: write skip-frame header (frame len)
Writer->>Sink: write frame chunk (<=128KiB)
alt multi-frame
Writer->>Sink: write next skip-frame header
Writer->>Sink: write next frame chunk
end
Writer-->>Encoder: return io::Result
Note right of IndexReader: exposes inner(), Size and ReadAt for reads
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
bd36372 to
708229f
Compare
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (1)
src/db/car/forest/index/mod.rs (1)
195-220: Add rustdoc for the new public reader type and methods.
sZstdSkipFramesEncodedDataReaderand its public API should be documented for maintainability and correct usage expectations.As per coding guidelines: "Document all public functions and structs".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/db/car/forest/index/mod.rs` around lines 195 - 220, Add Rustdoc comments for the public struct ZstdSkipFramesEncodedDataReader and its public methods (new, inner, into_inner): document the purpose of the reader, the meaning of skip_frame_header_offsets, the behavior of new (what it reads from the provided ReadAt, how offsets are computed, and that it returns io::Result on failure), what inner returns (borrowed reader) and what into_inner returns (consumes and returns the inner reader). Keep the docs concise, include parameter/return descriptions and any error/edge-case notes (e.g., empty input or partial frames), and add a short usage example or note about thread-safety if applicable.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/db/car/forest.rs`:
- Around line 151-152: The subtraction using footer-derived offsets (e.g.,
computing index_start_pos = footer.index - ZSTD_SKIP_FRAME_LEN and the similar
subtraction at the other location) must use checked arithmetic to avoid
underflow/wrap; update the code that computes index_start_pos and the related
subtraction at the other occurrence to use checked_sub (or an equivalent) and
map None to an io::Error/InvalidData return (or early Err) with a clear message
indicating corrupted footer-derived offsets; reference footer.index,
ZSTD_SKIP_FRAME_LEN, and the functions/methods that consume index_start_pos so
callers get an InvalidData error instead of panicking or wrapping.
In `@src/db/car/forest/index/mod.rs`:
- Around line 245-271: The read_at implementation currently scans
skip_frame_header_offsets linearly for each call, causing O(n) per read; change
it to a binary search using partition_point (or slice::binary_search_by /
slice::partition_point) on skip_frame_header_offsets to find the count/index of
offsets <= pos in O(log n), compute adjusted_pos = pos + (count as u64) *
ZSTD_SKIP_FRAME_LEN, obtain next_frame_pos from
skip_frame_header_offsets.get(count) if any, and then apply the existing logic
(compute max_read_len, call reader.read_at(adjusted_pos, ...), and recurse into
read_at for the remainder) using the found index instead of iterating. Ensure
types (usize/u64) are converted correctly and preserve the boundary-splitting
behavior when reads cross a skip-frame.
- Around line 482-489: The constant CHUNK_FRAME_DATA_MAX_BYTES is incorrectly
set to 128 * 1024 (128KiB) while the comment and intent describe 128MiB; update
CHUNK_FRAME_DATA_MAX_BYTES to represent 128 * 1024 * 1024 bytes (or 128 << 20)
so write_zstd_skip_frames_into_inner receives the correct chunk size, then keep
the call using self.written_len() and the u32::try_from(...) conversion as-is
(adjusting types if necessary) to avoid producing excessive skip frames.
---
Nitpick comments:
In `@src/db/car/forest/index/mod.rs`:
- Around line 195-220: Add Rustdoc comments for the public struct
ZstdSkipFramesEncodedDataReader and its public methods (new, inner, into_inner):
document the purpose of the reader, the meaning of skip_frame_header_offsets,
the behavior of new (what it reads from the provided ReadAt, how offsets are
computed, and that it returns io::Result on failure), what inner returns
(borrowed reader) and what into_inner returns (consumes and returns the inner
reader). Keep the docs concise, include parameter/return descriptions and any
error/edge-case notes (e.g., empty input or partial frames), and add a short
usage example or note about thread-safety if applicable.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: d3235db5-8ea2-4f57-b14f-470f1e618472
📒 Files selected for processing (6)
benches/car-index.rssrc/db/car/any.rssrc/db/car/forest.rssrc/db/car/forest/index/mod.rssrc/db/car/many.rssrc/tool/subcommands/archive_cmd.rs
There was a problem hiding this comment.
♻️ Duplicate comments (2)
src/db/car/forest/index/mod.rs (2)
245-265:⚠️ Potential issue | 🟠 Major
read_atstill performs linear skip-frame lookup per call.Line [248] scans
skip_frame_header_offsetslinearly on every read. With many frames, this becomes a major lookup bottleneck; switch to O(log n) frame mapping (binary search on precomputed logical boundaries).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/db/car/forest/index/mod.rs` around lines 245 - 265, read_at currently scans skip_frame_header_offsets linearly every call (see read_at, skip_frame_header_offsets, ZSTD_SKIP_FRAME_LEN, reader.read_at), causing O(n) lookup cost; replace that loop with a binary search over a precomputed vector of logical frame boundaries (e.g., frame_logical_starts or cumulative_adjusted_offsets computed once when building the structure) to find the frame index and the adjusted_pos in O(log n), then compute max_read_len from the next boundary and perform reader.read_at(adjusted_pos, &mut buf[..max_read_len]) followed by recursive/iterative continuation for the remainder; ensure correct u64/usize conversions and keep behavior identical to existing adjustments.
477-485:⚠️ Potential issue | 🟠 MajorChunk-size constant does not match the stated 128MiB intent.
Line [479] uses
128 * 1024(128KiB), which creates far more skip frames than intended and amplifies read-path overhead.Proposed fix
- const CHUNK_FRAME_DATA_MAX_BYTES: usize = 128 * 1024; + const CHUNK_FRAME_DATA_MAX_BYTES: usize = 128 * 1024 * 1024;🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/db/car/forest/index/mod.rs` around lines 477 - 485, The CHUNK_FRAME_DATA_MAX_BYTES constant in write_zstd_skip_frames_into is set to 128 * 1024 (128KiB) but the intent is 128MiB; update CHUNK_FRAME_DATA_MAX_BYTES to 128 * 1024 * 1024 (or an equivalent expression like 128 << 20) inside write_zstd_skip_frames_into so the call to write_zstd_skip_frames_into_inner uses the correct 128MiB chunk size, leaving the rest of the function (including the call to write_zstd_skip_frames_into_inner and u32::try_from(written_len).ok()) unchanged.
🧹 Nitpick comments (1)
src/db/car/forest/index/mod.rs (1)
195-220: Add rustdoc for the new public reader API.Line [195] introduces a public struct and Lines [201], [214], and [218] expose public methods without documentation. Please add concise docs for expected input format and malformed-data behavior.
As per coding guidelines "Document all public functions and structs".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/db/car/forest/index/mod.rs` around lines 195 - 220, Add Rustdoc comments for the public struct ZstdSkipFramesEncodedDataReader and its public methods new, inner, and into_inner: describe the expected input format (a sequence of skip-frame headers followed by ZSTD_SKIP_FRAME_LEN + payload length, with lengths read as little-endian u32 at offset+4), state that new parses the entire reader to build skip_frame_header_offsets, and document malformed-data behavior (what happens when read_u32_at fails — e.g., it returns an io::Result error from new — and that inner/into_inner simply return the wrapped reader without validation). Keep the docs concise, mention ownership/borrowing semantics for inner vs into_inner, and reference ZSTD_SKIP_FRAME_LEN for clarity.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@src/db/car/forest/index/mod.rs`:
- Around line 245-265: read_at currently scans skip_frame_header_offsets
linearly every call (see read_at, skip_frame_header_offsets,
ZSTD_SKIP_FRAME_LEN, reader.read_at), causing O(n) lookup cost; replace that
loop with a binary search over a precomputed vector of logical frame boundaries
(e.g., frame_logical_starts or cumulative_adjusted_offsets computed once when
building the structure) to find the frame index and the adjusted_pos in O(log
n), then compute max_read_len from the next boundary and perform
reader.read_at(adjusted_pos, &mut buf[..max_read_len]) followed by
recursive/iterative continuation for the remainder; ensure correct u64/usize
conversions and keep behavior identical to existing adjustments.
- Around line 477-485: The CHUNK_FRAME_DATA_MAX_BYTES constant in
write_zstd_skip_frames_into is set to 128 * 1024 (128KiB) but the intent is
128MiB; update CHUNK_FRAME_DATA_MAX_BYTES to 128 * 1024 * 1024 (or an equivalent
expression like 128 << 20) inside write_zstd_skip_frames_into so the call to
write_zstd_skip_frames_into_inner uses the correct 128MiB chunk size, leaving
the rest of the function (including the call to
write_zstd_skip_frames_into_inner and u32::try_from(written_len).ok())
unchanged.
---
Nitpick comments:
In `@src/db/car/forest/index/mod.rs`:
- Around line 195-220: Add Rustdoc comments for the public struct
ZstdSkipFramesEncodedDataReader and its public methods new, inner, and
into_inner: describe the expected input format (a sequence of skip-frame headers
followed by ZSTD_SKIP_FRAME_LEN + payload length, with lengths read as
little-endian u32 at offset+4), state that new parses the entire reader to build
skip_frame_header_offsets, and document malformed-data behavior (what happens
when read_u32_at fails — e.g., it returns an io::Result error from new — and
that inner/into_inner simply return the wrapped reader without validation). Keep
the docs concise, mention ownership/borrowing semantics for inner vs into_inner,
and reference ZSTD_SKIP_FRAME_LEN for clarity.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 9d79aa75-859e-4040-b22d-457ff296ae51
📒 Files selected for processing (1)
src/db/car/forest/index/mod.rs
e6c8c27 to
ce346cb
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (2)
src/db/car/forest/index/mod.rs (1)
245-265:⚠️ Potential issue | 🟠 Major
read_atremains linear per call and can bottleneck large multi-frame indices.Line 248 iterates all skip-frame offsets, and Line 264 recurses, so repeated reads across many frames can degrade heavily. Please switch to indexed lookup (e.g., binary-search-based frame mapping) to keep lookup cost predictable at scale.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/db/car/forest/index/mod.rs` around lines 245 - 265, The read_at implementation on read_at is doing a linear scan over skip_frame_header_offsets and recursive reads, causing O(n) per call and worst-case O(n^2) across many frames; replace the linear iteration/recursion with an indexed lookup (e.g., binary_search on skip_frame_header_offsets) to map pos→frame quickly, compute adjusted_pos once using the found index and ZSTD_SKIP_FRAME_LEN, then perform a bounded single reader.read_at call (reader.read_at) for the chunk within that frame and loop/iterate (not recurse) to continue for remaining bytes—update read_at to use binary_search (or equivalent) over skip_frame_header_offsets to find next_frame_pos and avoid per-call linear scans and recursion.src/db/car/forest.rs (1)
166-166:⚠️ Potential issue | 🟠 MajorGuard index range computation with checked subtraction.
Line 166 still performs unchecked
index_end_pos - index_start_pos. Corrupt footer data can make this underflow; returnInvalidDatainstead of panicking/wrapping.Proposed fix
- Ok((header, index_start_pos, index_end_pos - index_start_pos)) + let index_size_bytes = index_end_pos.checked_sub(index_start_pos).ok_or_else(|| { + invalid_data(format!( + "invalid footer offsets: index_start_pos({index_start_pos}) > index_end_pos({index_end_pos})" + )) + })?; + Ok((header, index_start_pos, index_size_bytes))🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/db/car/forest.rs` at line 166, The subtraction index_end_pos - index_start_pos can underflow on corrupt footer data; replace the unchecked subtraction with a checked subtraction (e.g., index_end_pos.checked_sub(index_start_pos)) and if it returns None return an Err with InvalidData (std::io::Error::new(ErrorKind::InvalidData, ...)) instead of producing a wrapped/negative value; update the Ok((header, ...)) return to use the checked difference.
🧹 Nitpick comments (1)
src/db/car/forest/index/mod.rs (1)
195-220: Add rustdoc for the new public reader API.
ZstdSkipFramesEncodedDataReaderand its public methods are exported without documentation. Please add concise docs for expected input format, invariants, and failure modes.As per coding guidelines,
Document all public functions and structs.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/db/car/forest/index/mod.rs` around lines 195 - 220, Add Rustdoc comments for the public ZstdSkipFramesEncodedDataReader struct and its public methods (new, inner, into_inner): describe the expected input format (sequence of skip-frame headers followed by frames, how header length is read via read_u32_at with LittleEndian and the use of ZSTD_SKIP_FRAME_LEN), list invariants (skip_frame_header_offsets contains offsets discovered during construction, reader must implement ReadAt, the reader is not validated beyond iterating until read_u32_at fails), and document failure modes (new returns io::Error if underlying read_u32_at fails unexpectedly, or stops normally when EOF/invalid header occurs; callers should handle potential partial/invalid data), plus any ownership/borrowing behavior (inner returns a reference; into_inner consumes and returns the inner reader). Ensure doc comments are concise, placed above the struct and each method, and follow crate doc-style guidelines.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/db/car/forest.rs`:
- Around line 191-201: The indexed remapping in into_dyn loses the original
bounded length by passing None as the slice length; preserve the original index
bound by extracting the index size from the original slice (e.g., call
slice.inner().index_size_bytes() or the appropriate accessor on slice.inner())
and pass Some(index_size_bytes) instead of None to positioned_io::Slice::new
when constructing ZstdSkipFramesEncodedDataReader in ForestCar::into_dyn so the
remapped reader only exposes the index bytes.
---
Duplicate comments:
In `@src/db/car/forest.rs`:
- Line 166: The subtraction index_end_pos - index_start_pos can underflow on
corrupt footer data; replace the unchecked subtraction with a checked
subtraction (e.g., index_end_pos.checked_sub(index_start_pos)) and if it returns
None return an Err with InvalidData (std::io::Error::new(ErrorKind::InvalidData,
...)) instead of producing a wrapped/negative value; update the Ok((header,
...)) return to use the checked difference.
In `@src/db/car/forest/index/mod.rs`:
- Around line 245-265: The read_at implementation on read_at is doing a linear
scan over skip_frame_header_offsets and recursive reads, causing O(n) per call
and worst-case O(n^2) across many frames; replace the linear iteration/recursion
with an indexed lookup (e.g., binary_search on skip_frame_header_offsets) to map
pos→frame quickly, compute adjusted_pos once using the found index and
ZSTD_SKIP_FRAME_LEN, then perform a bounded single reader.read_at call
(reader.read_at) for the chunk within that frame and loop/iterate (not recurse)
to continue for remaining bytes—update read_at to use binary_search (or
equivalent) over skip_frame_header_offsets to find next_frame_pos and avoid
per-call linear scans and recursion.
---
Nitpick comments:
In `@src/db/car/forest/index/mod.rs`:
- Around line 195-220: Add Rustdoc comments for the public
ZstdSkipFramesEncodedDataReader struct and its public methods (new, inner,
into_inner): describe the expected input format (sequence of skip-frame headers
followed by frames, how header length is read via read_u32_at with LittleEndian
and the use of ZSTD_SKIP_FRAME_LEN), list invariants (skip_frame_header_offsets
contains offsets discovered during construction, reader must implement ReadAt,
the reader is not validated beyond iterating until read_u32_at fails), and
document failure modes (new returns io::Error if underlying read_u32_at fails
unexpectedly, or stops normally when EOF/invalid header occurs; callers should
handle potential partial/invalid data), plus any ownership/borrowing behavior
(inner returns a reference; into_inner consumes and returns the inner reader).
Ensure doc comments are concise, placed above the struct and each method, and
follow crate doc-style guidelines.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: b95605f5-6514-4db4-913c-f403adbaebdb
📒 Files selected for processing (2)
src/db/car/forest.rssrc/db/car/forest/index/mod.rs
Codecov Report❌ Patch coverage is Additional details and impacted files
... and 12 files with indirect coverage changes Continue to review full report in Codecov by Sentry.
🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
♻️ Duplicate comments (1)
src/db/car/forest.rs (1)
166-166:⚠️ Potential issue | 🟠 MajorGuard footer-derived index size math with
checked_sub.Line 166 still does unchecked subtraction on on-disk values. Corrupt footer data can underflow and panic/wrap instead of returning
InvalidData.Proposed hardening
- Ok((header, index_start_pos, index_end_pos - index_start_pos)) + let index_size_bytes = index_end_pos + .checked_sub(index_start_pos) + .ok_or_else(|| invalid_data("invalid footer: index start is after index end"))?; + Ok((header, index_start_pos, index_size_bytes))🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/db/car/forest.rs` at line 166, The subtraction index_end_pos - index_start_pos is unchecked and can underflow on corrupt footer data; replace it with a checked subtraction (use index_end_pos.checked_sub(index_start_pos)) and if it returns None return an Err(io::Error::new(ErrorKind::InvalidData, ...)) instead of panicking—update the return for Ok((header, index_start_pos, ...)) to use the validated size and ensure the error path yields InvalidData; refer to the symbols header, index_start_pos, and index_end_pos to locate the change.
🧹 Nitpick comments (1)
src/db/car/forest.rs (1)
179-192: Add rustdoc for public APIs touched in this change.
index_size_bytesandinto_dynare public and currently undocumented.As per coding guidelines, "Document all public functions and structs".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/db/car/forest.rs` around lines 179 - 192, Add concise rustdoc comments for the public methods index_size_bytes and into_dyn (and the public ForestCar type if not already documented): describe what value index_size_bytes returns (units and meaning), and explain into_dyn’s behavior and return type (that it converts the ForestCar into a dynamic boxed RandomAccessFileReader wrapped in io::Result). Place the docs immediately above the fn signatures (index_size_bytes, into_dyn) and ensure they follow Rust doc conventions (///) and mention error conditions for into_dyn if any.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@src/db/car/forest.rs`:
- Line 166: The subtraction index_end_pos - index_start_pos is unchecked and can
underflow on corrupt footer data; replace it with a checked subtraction (use
index_end_pos.checked_sub(index_start_pos)) and if it returns None return an
Err(io::Error::new(ErrorKind::InvalidData, ...)) instead of panicking—update the
return for Ok((header, index_start_pos, ...)) to use the validated size and
ensure the error path yields InvalidData; refer to the symbols header,
index_start_pos, and index_end_pos to locate the change.
---
Nitpick comments:
In `@src/db/car/forest.rs`:
- Around line 179-192: Add concise rustdoc comments for the public methods
index_size_bytes and into_dyn (and the public ForestCar type if not already
documented): describe what value index_size_bytes returns (units and meaning),
and explain into_dyn’s behavior and return type (that it converts the ForestCar
into a dynamic boxed RandomAccessFileReader wrapped in io::Result). Place the
docs immediately above the fn signatures (index_size_bytes, into_dyn) and ensure
they follow Rust doc conventions (///) and mention error conditions for into_dyn
if any.
ff6f458 to
8e9078c
Compare
Summary of changes
Changes introduced in this pull request:
size > u32::MAXorFOREST_CAR_INDEX_USE_MULTIPLE_SKIP_FRAMES=1to keep backward compatibilityReference issue to close (if applicable)
Closes #6673
Other information and links
Change checklist
Outside contributions
Summary by CodeRabbit
Bug Fixes
Improvements